{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# WebBaseLoader\n",
    "\n",
    "- Author: [Kane](https://github.com/HarryKane11)\n",
    "- Design: [Kane](https://github.com/HarryKane11)\n",
    "- Peer Review : [JoonHo Kim](https://github.com/jhboyo), [Sunyoung Park (architectyou)](https://github.com/Architectyou)\n",
    "- Author: [Yejin Park](https://github.com/ppakyeah)\n",
    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/03-WebBaseLoader.ipynb)[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/03-WebBaseLoader.ipynb)\n",
    "## Overview\n",
    "\n",
    "```WebBaseLoader``` is a specialized document loader in LangChain designed for processing web-based content. \n",
    "\n",
    "It leverages the ```BeautifulSoup4``` library to parse web pages effectively, offering customizable parsing options through ```SoupStrainer``` and additional ```bs4``` parameters.\n",
    "\n",
    "This tutorial demonstrates how to use ```WebBaseLoader``` to:\n",
    "1. Load and parse web documents effectively\n",
    "2. Customize parsing behavior using ```BeautifulSoup``` options\n",
    "3. Handle different web content structures flexibly.\n",
    "\n",
    "### Table of Contents \n",
    "\n",
    "- [Overview](#overview)\n",
    "- [Environment Setup](#environment-setup)\n",
    "- [Load Web-based documents](#load-web-based-documents)\n",
    "- [Load Multiple URLs Concurrently with alazy_load](#load-multiple-urls-concurrently-with-alazy_load)\n",
    "- [Load XML Documents](#load-xml-documents)\n",
    "- [Load Web based document Using Proxies](#load-web-based-document-using-proxies)\n",
    "- [Simple Web Content Loading with MarkItDown](#simple-web-content-loading-with-markitdown)\n",
    "\n",
    "### References\n",
    "\n",
    "- [WebBaseLoader API Documentation](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html)\n",
    "- [BeautifulSoup4 Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
    "\n",
    "**[Note]**\n",
    "- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
    "- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install langchain-opentutorial markitdown"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install required packages\n",
    "from langchain_opentutorial import package\n",
    "\n",
    "package.install(\n",
    "    [\n",
    "        \"langchain_community\",\n",
    "    ],\n",
    "    verbose=False,\n",
    "    upgrade=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Web-based documents\n",
    "\n",
    "```WebBaseLoader``` is a loader designed for loading web-based documents.\n",
    "\n",
    "It uses the ```bs4``` library to parse web pages.\n",
    "\n",
    "Key Features:\n",
    "- Uses ```bs4.SoupStrainer``` to specify elements to parse.\n",
    "- Accepts additional arguments for ```bs4.SoupStrainer``` through the ```bs_kwargs``` parameter.\n",
    "\n",
    "For more details, refer to the API documentation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of documents: 1\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Document(metadata={'source': 'https://techcrunch.com/2024/12/28/google-ceo-says-ai-model-gemini-will-the-companys-biggest-focus-in-2025/'}, page_content='\\nCEO Sundar Pichai reportedly told Google employees that 2025 will be a “critical” year for the company.\\nCNBC reports that it obtained audio from a December 18 strategy meeting where Pichai and other executives put on ugly holiday sweaters and laid out their priorities for the coming year.\\n\\n\\n\\n\\n\\n\\n\\n\\n“I think 2025 will be critical,” Pichai said. “I think it’s really important we internalize the urgency of this moment, and need to move faster as a company. The stakes are high.”\\nThe moment, of course, is one where tech companies like Google are making heavy investments in AI, and often with mixed results. Pichai acknowledged that the company has some catching up to do on the AI side — he described the Gemini app (based on the company’s AI model of the same name) as having “strong momentum,” while also acknowledging “we have some work to do in 2025 to close the gap and establish a leadership position there as well.”\\n“Scaling Gemini on the consumer side will be our biggest focus next year,” he said.\\n')"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import bs4\n",
    "from langchain_community.document_loaders import WebBaseLoader\n",
    "\n",
    "# Load news article content using WebBaseLoader\n",
    "loader = WebBaseLoader(\n",
    "   web_paths=(\"https://techcrunch.com/2024/12/28/google-ceo-says-ai-model-gemini-will-the-companys-biggest-focus-in-2025/\",),\n",
    "   # Configure BeautifulSoup to parse only specific div elements\n",
    "   bs_kwargs=dict(\n",
    "       parse_only=bs4.SoupStrainer(\n",
    "           \"div\",\n",
    "           attrs={\"class\": [\"entry-content wp-block-post-content is-layout-constrained wp-block-post-content-is-layout-constrained\"]},\n",
    "       )\n",
    "   ),\n",
    "   # Set user agent in request header to mimic browser\n",
    "   header_template={\n",
    "       \"User_Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36\",\n",
    "   },\n",
    ")\n",
    "\n",
    "# Load and process the documents\n",
    "docs = loader.load()\n",
    "print(f\"Number of documents: {len(docs)}\")\n",
    "docs[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To bypass SSL authentication errors, you can set the ```“verify”``` option."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "c:\\Users\\gram\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\langchain-opentutorial-RXtDr8w5-py3.11\\Lib\\site-packages\\urllib3\\connectionpool.py:1097: InsecureRequestWarning: Unverified HTTPS request is being made to host 'techcrunch.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Document(metadata={'source': 'https://techcrunch.com/2024/12/28/google-ceo-says-ai-model-gemini-will-the-companys-biggest-focus-in-2025/'}, page_content='\\nCEO Sundar Pichai reportedly told Google employees that 2025 will be a “critical” year for the company.\\nCNBC reports that it obtained audio from a December 18 strategy meeting where Pichai and other executives put on ugly holiday sweaters and laid out their priorities for the coming year.\\n\\n\\n\\n\\n\\n\\n\\n\\n“I think 2025 will be critical,” Pichai said. “I think it’s really important we internalize the urgency of this moment, and need to move faster as a company. The stakes are high.”\\nThe moment, of course, is one where tech companies like Google are making heavy investments in AI, and often with mixed results. Pichai acknowledged that the company has some catching up to do on the AI side — he described the Gemini app (based on the company’s AI model of the same name) as having “strong momentum,” while also acknowledging “we have some work to do in 2025 to close the gap and establish a leadership position there as well.”\\n“Scaling Gemini on the consumer side will be our biggest focus next year,” he said.\\n')"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Bypass SSL certificate verification\n",
    "loader.requests_kwargs = {\"verify\": False}\n",
    "\n",
    "# Load documents from the web\n",
    "docs = loader.load()\n",
    "docs[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also load multiple webpages at once. To do this, you can pass a list of **urls** to the loader, which will return a list of documents in the order of the **urls** passed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2\n"
     ]
    }
   ],
   "source": [
    "# Initialize the WebBaseLoader with web page paths and parsing configurations\n",
    "loader = WebBaseLoader(\n",
    "    web_paths=[\n",
    "        # List of web pages to load\n",
    "        \"https://techcrunch.com/2024/12/28/revisiting-the-biggest-moments-in-the-space-industry-in-2024/\",\n",
    "        \"https://techcrunch.com/2024/12/29/ai-data-centers-could-be-distorting-the-us-power-grid/\",\n",
    "    ],\n",
    "    bs_kwargs=dict(\n",
    "        # BeautifulSoup settings to parse only the specific content section\n",
    "        parse_only=bs4.SoupStrainer(\n",
    "            \"div\",\n",
    "            attrs={\"class\": [\"entry-content wp-block-post-content is-layout-constrained wp-block-post-content-is-layout-constrained\"]},\n",
    "        )\n",
    "    ),\n",
    "    header_template={\n",
    "        # Custom HTTP headers for the request (e.g., User-Agent for simulating a browser)\n",
    "        \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36\",\n",
    "    },\n",
    ")\n",
    "\n",
    "# Load the data from the specified web pages\n",
    "docs = loader.load()\n",
    "\n",
    "# Check and print the number of documents loaded\n",
    "print(len(docs))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Output the results fetched from the web."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "We are at the dawn of a new space age.  If you doubt, simply look back at the last year: From SpaceX’s historic catch of the Super Heavy booster to the record-breaking number of lunar landing attempts, this year was full of historic and ambitious missions and demonstrations. \n",
      "We’re taking a look back at the five most significant moments or trends in the space industry this year. Naysayers might think SpaceX is overrepresented on this list, but that just shows how far ahead the space behemoth is\n",
      "==============================\n",
      "\n",
      "The proliferation of data centers aiming to meet the computational needs of AI could be bad news for the US power grid, according to a new report in Bloomberg.\n",
      "Using the 1 million residential sensors tracked by Whisker Labs, along with market intelligence data from DC Byte, Bloomberg found that more than half of the households showing the worst power distortions live within 20 miles of significant data center activity.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "In other words, there appears to be a link between data center proxi\n"
     ]
    }
   ],
   "source": [
    "print(docs[0].page_content[:500])\n",
    "print(\"===\" * 10)\n",
    "print(docs[1].page_content[:500])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Multiple URLs Concurrently with ```alazy_load()```\n",
    "\n",
    "You can speed up the process of scraping and parsing multiple URLs by using asynchronous loading. This allows you to fetch documents concurrently, improving efficiency while adhering to rate limits.\n",
    "\n",
    "### Key Points:\n",
    "- **Rate Limit** : The ```requests_per_second``` parameter controls how many requests are made per second. In this example, it's set to 1 to avoid overloading the server.\n",
    "- **Asynchronous Loading** : The ```alazy_load()``` function is used to load documents asynchronously, enabling faster processing of multiple URLs.\n",
    "- **Jupyter Notebook Compatibility** : If running in Jupyter Notebook, ```nest_asyncio``` is required to handle asynchronous tasks properly.\n",
    "\n",
    "The code below demonstrates how to configure and load documents asynchronously:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "# only for jupyter notebook (asyncio)\n",
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the requests per second rate limit\n",
    "loader.requests_per_second = 1\n",
    "\n",
    "# Load documents asynchronously\n",
    "# The aload() is deprecated and alazy_load() is used since the langchain 3.14 update)\n",
    "docs=[]\n",
    "async for doc in loader.alazy_load():\n",
    "    docs.append(doc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(metadata={'source': 'https://techcrunch.com/2024/12/28/google-ceo-says-ai-model-gemini-will-the-companys-biggest-focus-in-2025/'}, page_content='\\nCEO Sundar Pichai reportedly told Google employees that 2025 will be a “critical” year for the company.\\nCNBC reports that it obtained audio from a December 18 strategy meeting where Pichai and other executives put on ugly holiday sweaters and laid out their priorities for the coming year.\\n\\n\\n\\n\\n\\n\\n\\n\\n“I think 2025 will be critical,” Pichai said. “I think it’s really important we internalize the urgency of this moment, and need to move faster as a company. The stakes are high.”\\nThe moment, of course, is one where tech companies like Google are making heavy investments in AI, and often with mixed results. Pichai acknowledged that the company has some catching up to do on the AI side — he described the Gemini app (based on the company’s AI model of the same name) as having “strong momentum,” while also acknowledging “we have some work to do in 2025 to close the gap and establish a leadership position there as well.”\\n“Scaling Gemini on the consumer side will be our biggest focus next year,” he said.\\n')]"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display loaded documents\n",
    "docs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load XML Documents\n",
    "\n",
    "```WebBaseLoader``` can process XML files by specifying a different ```BeautifulSoup``` parser. This is particularly useful when working with structured XML content like sitemaps or government data.\n",
    "\n",
    "### Basic XML Loading\n",
    "\n",
    "The following example demonstrates loading an XML document from a government website:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import WebBaseLoader\n",
    "\n",
    "# Initialize loader with XML document URL\n",
    "loader = WebBaseLoader(\n",
    "    \"https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml\"\n",
    ")\n",
    "\n",
    "# Set parser to XML mode\n",
    "loader.default_parser = \"xml\"\n",
    "\n",
    "# Load and process the document\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Memory-Efficient Loading\n",
    "\n",
    "For handling large documents, ```WebBaseLoader``` provides two memory-efficient loading methods:\n",
    "\n",
    "1. lazy_load() - loads one page at a time\n",
    "2. alazy_load() - asynchronous page loading for better performance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "10\n",
      "Energy\n",
      "3\n",
      "2018-01-01\n",
      "2018-01-01\n",
      "false\n",
      "Uniform test method for the measurement of energy efficien\n",
      "{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}\n"
     ]
    }
   ],
   "source": [
    "# Lazy Loading Example\n",
    "pages = []\n",
    "for doc in loader.lazy_load():\n",
    "    pages.append(doc)\n",
    "\n",
    "# Print first 100 characters and metadata of the first page\n",
    "print(pages[0].page_content[:100])\n",
    "print(pages[0].metadata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "10\n",
      "Energy\n",
      "3\n",
      "2018-01-01\n",
      "2018-01-01\n",
      "false\n",
      "Uniform test method for the measurement of energy efficien\n",
      "{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}\n"
     ]
    }
   ],
   "source": [
    "# Async Loading Example\n",
    "pages = []\n",
    "async for doc in loader.alazy_load():\n",
    "    pages.append(doc)\n",
    "\n",
    "# Print first 100 characters and metadata of the first page\n",
    "print(pages[0].page_content[:100])\n",
    "print(pages[0].metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Web-based Document Using Proxies\n",
    "\n",
    "Sometimes you may need to use proxies to bypass IP blocking.\n",
    "\n",
    "To use a proxy, you can pass a proxy dictionary to the loader (and its underlying ```requests``` library).\n",
    "\n",
    "### ⚠️ Warning:\n",
    "- Replace ```{username}```, ```{password}```, and ```proxy.service.com``` with your actual proxy credentials and server information.\n",
    "- Without a valid proxy configuration, errors such as **ProxyError** or **AuthenticationError** may occur."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = WebBaseLoader(\n",
    "   \"https://www.google.com/search?q=parrots\",\n",
    "   proxies={\n",
    "       \"http\": \"http://{username}:{password}:@proxy.service.com:6666/\",\n",
    "       \"https\": \"https://{username}:{password}:@proxy.service.com:6666/\",\n",
    "   },\n",
    "   # Initialize the web loader with proxy settings\n",
    "   # Configure proxy for both HTTP and HTTPS requests\n",
    ")\n",
    "\n",
    "# Load documents using the proxy\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simple Web Content Loading with ```MarkItDown```\n",
    "\n",
    "Unlike ```WebBaseLoader``` which uses ```BeautifulSoup4``` for sophisticated HTML parsing, ```MarkItDown``` provides a naive but simpler approach to web content loading. It directly fetches web content using HTTP requests and transfrom it into markdown format without detailed parsing capabilities.\n",
    "\n",
    "Below is a basic example of loading web content using ```MarkItDown```:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "from markitdown import MarkItDown\n",
    "\n",
    "md = MarkItDown()\n",
    "result = md.convert(\"https://techcrunch.com/2024/12/28/revisiting-the-biggest-moments-in-the-space-industry-in-2024/\")\n",
    "result_text = result.text_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "[![](https://techcrunch.com/wp-content/uploads/2024/09/tc-lockup.svg) TechCrunch Desktop Logo](https://techcrunch.com)\n",
      "\n",
      "[![](https://techcrunch.com/wp-content/uploads/2024/09/tc-logo-mobile.svg) TechCrunch Mobile Logo](https://techcrunch.com)\n",
      "\n",
      "* [Latest](/latest/)\n",
      "* [Startups](/category/startups/)\n",
      "* [Venture](/category/venture/)\n",
      "* [Apple](/tag/apple/)\n",
      "* [Security](/category/security/)\n",
      "* [AI](/category/artificial-intelligence/)\n",
      "* [Apps](/category/apps/)\n",
      "\n",
      "* [Events](/events/)\n",
      "* [Podcasts](/podcasts/)\n",
      "* [Newsletters](/newsletters/)\n",
      "\n",
      "[Sign In](https://oidc.techcrunch.com/login/?dest=https%3A%2F%2Ftechcrunch.com%2F2024%2F12%2F28%2Frevisiting-the-biggest-moments-in-the-space-industry-in-2024%2F)\n",
      "[![]()](https://techcrunch.com/my-account/)\n",
      "\n",
      "SearchSubmit\n",
      "\n",
      "Site Search Toggle\n",
      "\n",
      "Mega Menu Toggle\n",
      "\n",
      "### Topics\n",
      "\n",
      "[Latest](/latest/)\n",
      "\n",
      "[AI](/category/artificial-intelligence/)\n",
      "\n",
      "[Amazon](/tag/amazon/)\n",
      "\n",
      "[Apps](/category/apps/)\n",
      "\n",
      "[Biotech & Health](/category/biotech-health/)\n",
      "\n",
      "[Climate](/category/climate/)\n",
      "\n",
      "[\n"
     ]
    }
   ],
   "source": [
    "print(result_text[:1000])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain-opentutorial-RXtDr8w5-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
